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ABSTRACT 

\ 

The  idea  of  grouping  the  non-zero  elements  of  a  sparse  matrix  into  few  stripes  that  are 
almost  parallel  is  applied  to  the  design  of  a  systolic  accelerator  for  sparse  matrix  operations. 
This  accelerator  is.  then,  integrated  into  a  complete  systolic  system  for  the  solution  of  large 
sparse  linear  systems  of  equations.  The  design  demonstrates  that  the  application  of  systolic 
arrays  is  not  limited  to  regular  computations,  and  that  computationally  irregular  problems 
may  be  solved  on  systolic  networks  if  local  storage  is  provided  in  each  systolic  cell  for 
buffering  the  irregularity  in  the  data  movement  and  for  absorbing  the  irregularity  in  the 
computation.  - - — 
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1.  INTRODUCTION 


In  [6]  and  [7],  stripe  structures  of  matrices  were  introduced  as  a  means  for  the  inclu¬ 
sion  of  all  the  non-zero  elements  of  a  sparse  matrix  into  a  regular  pattern  which  is  suitable 
for  parallel  computation.  This  stripe  approach  have  proved  to  be  useful  for  the  manipula¬ 
tion  of  sparse  matrices  on  data-driven  networks.  It  is  especially  attractive  for  the  type  of 
matrices  that  result  from  finite  element  analysis.  More  specifically,  the  number  of  stripes. 
i r.  in  finite  element  matrices  is  small  and  depends  only  on  the  method  used  for  the  discreti¬ 
zation.  That  is.  ir  is  independent  of  the  size  of  the  problem.  Algorithms  for  the  generation 
of  stripe  structures  for  sparse  matrices,  in  general,  and  for  finite  element  matrices,  in  par¬ 
ticular.  are  given  in  [6]  and  [8].  respectively.. 

However,  stripes,  as  defined  in  [6].  are  not  regular  enough  to  allow  for  the  manipula¬ 
tion  of  sparse  matrices  on  systolic  networks.  More  specifically,  a  particular  stripe  may. 
itself,  be  sparse.  In  data  driven  networks,  where  the  operation  of  each  cell  is  initiated  by 
the  availability  of  data,  the  sparsity  within  a  stripe  may  cause  execution  delays,  but  does 
not  affect  the  correctness  of  the  computation.  On  the  other  hand,  systolic  arrays,  which  are 
globally  synchronized,  are  characterized  by  the  property  that  the  location  of  each  data  item 
at  any  time  may.  and  should,  be  determined  before  the  beginning  of  execution.  This  is  not 
possible  if  stripes  are  sparse. 

Hence,  for  each  sparse  stripe  S ,  we  introduce,  in  Section  2,  a  non  sparse  structure 
which  we  call  the  complement  of  S .  Using  the  complements  of  the  stripes,  it  is  possible  to 
design  linear  systolic  arrays  for  the  multiplication  of  a  vector  by  a  striped  matrix.  These 
arrays  are  presented  in  Section  3.  along  with  an  analysis  which  determines  the  exact  timing 
of  the  input  data  required  for  the  correctness  of  the  computation. 

In  Section  4.  the  last  cell  in  the  matrix/vector  multiplication  array  is  modified  to  feed 
data  back  into  the  array  and  to  obtain  a  network  for  the  solution  of  triangular  linear  sys¬ 
tems.  The  feed  back,  however,  creates  a  data  dependence  which  may  lead  to  incorrect  com¬ 
putations.  A  condition  which  guarantees  the  correctness  of  the  computation  is  formulated 


in  terms  of  the  separation  between  the  stripes  of  the  matrix  and  the  rate  at  which  the  input 
data  is  supplied  to  the  network. 

Matrix/vector  multiplication  and  the  solution  of  triangular  linear  systems  are  the 
only  0(n2)  operations  in  each  iteration  of  the  preconditioned  conjugate  gradient  method 
(PCCG),  which  is  one  of  the  most  efficient  techniques  used  for  the  iterative  solution  of  large 
sparse  linear  systems.  Hence,  these  two  matrix  operations  should  be  the  primary  target  for 
acceleration  in  any  parallel  system  used  for  the  solution  of  large  linear  systems.  However, 
the  application  of  the  PCCG  involves  also  some  O  (n  )  operations,  as  for  example  vector 
addition  and  inner  product  computations.  These  O  (n  )  operations  may  not  be  ignored 
because  they  may  become  the  bottle-neck  of  the  parallel  system  (see  [4]  for  example).  A 
complete  systolic  system  which  accelerates  both  O  (n 2)  and  O(n)  operations  in  the  PCCG 
method  is  described  in  Section  5. 

2.  STRIPE  STRUCTURES  AND  THEIR  COMPLEMENTS 

In  [6],  a  stripes  S  in  a  matrix  A  *  {a;  j  :  i  ,j  =1.....»  }  is  defined  to  be  a  set  of  positions 
in  A  which  contains  at  most  one  position  for  each  row.  More  specifically. 

S  -  {(» .<x(t  ))  ;  i  c  i_dom  (S ).  and  <r(i  )  <  <r(f )  for  i  <  l  }  (1) 

where  i_dom  (5)  is  a  subset  of  {1,  •  •  •  ji )  and  cr  is  an  increasing  function  which  associates 
a  column  index  cr(i )  with  row  i .  An  alternative  definition  of  a  stripe  may  be  given  in 
terms  of  a  function  n  which  associates  a  row  index  fi(j)  with  column  j .  That  is 

S  =  KmO  )•/  )  :  j  €  j_dom  (S )  .and  n(j  )  <  fi(.m  )  for  /  <  m  }  (2) 

where  y_dom(S)  is  a  subset  of  {1,  •  •  •,«}.  Clearly,  if  i  €i_dom(S).  then 
cr(i  )  cj_dom  (5  )  and  i  =  )).  That  is  n  =  <r~l. 

Given  two  stripes  Si={(*  ,0'1(t  ))J  and  $2={(i  .<r2(i  ))}  the  ordering  relation  <  may  be 
defined  such  that  Sj  <  S2  if  ^(t )  <  )  for  i  <  j .  With  this,  a  stripe  structure  of  A 

is  a  set  of  stripes  LA  *  {Si,...  .S*}  such  that  Si<S2<--<S„  and  Si  U  (J  Sw  contains 
all  the  positions  of  the  nonzero  elements  in  A  .  That  is.  if  (i  ,j )  t  Sj  (J  •  ••  |J  Sw.  then 


Oj  j  *0.  ir  is  called  the  stripe  count  of  l.A .  As  an  example,  we  denote  in  Figure  1  the  zero 
and  nonzero  elements  of  a  matrix  by  V  and  "x\  respectively,  and  we  show  a  possible  stripe 
structure  for  that  matrix.  Note  that  the  positions  included  in  the  stripes  are  enclosed  in  cir¬ 
cles. 


S3  S4  S j 


From  the  definitions  (1)  and  (2).  it  is  not  essential  that  5  contains  a  position  for  every 
row  i  »1  ....ji  or  for  every  column  j  =l..../i ,  and  thus,  stripes  may  be  sparse.  Although 
stripes  which  are  sparse  introduce  some  irregularity  in  the  stripe  structure  of  matrices,  it  is 
shown  in  [6]  that  linearly  connected  arrays  of  cells  may  be  used  effectively  for  the  manipu¬ 
lation  of  striped  matrices,  provided  that  data-driven  synchronization  is  used.  In  other 
words,  it  is  possible  to  cope  with  the  sparsity  within  a  stripe,  if  the  operation  of  each  cell  is 
initiated  by  the  availability  of  data. 

On  the  other  hand,  if  the  array  is  synchronized  by  a  global  clock,  then  each  cell  is 
expected  to  receive  some  data  every  clock  cycle.  Hence,  sparse  stripes  should  be  augmented 
such  that  a  data  item  is  presented  to  the  network  each  clock  cycle.  One  way  of  augmenting 
sparse  stripes  is  to  include  in  S  a  position  for  each  row  of  A  .  In  this  case,  the  augmented 
stripe  is  called  the  row  complement  of  S  and  is  defined  by 

Sr  ■  {(» ,<J(i ));  i  «  ) 


where 


-4- 


and 


&U)>  &U- 1) 


*  *2....,n 


(3.b) 


Note  that  Sr  is  not  a  stripe  because  cr  is  not  strictly  increasing.  In  fact,  given  S .  it 
may  be  impossible  to  augment  <r  such  that  cr  is  strictly  increasing.  Although  conditions  of 
the  form  given  by  (3)  may  be  shown  to  ensure  the  correct  operation  of  systolic  networks  on 
striped  matrices,  it  should  be  clear  that  (3.a/b)  do  not  define  Sr  uniquely,  and  that  a 
unique  row  complement  of  5  may  be  only  obtained  if  <r  is  defined  in  a  more  restrictive 
manner.  In  this  paper,  we  adopt  the  following  unique  definition  of  cr: 


a(i)  = 


cKt) 

if  in  <  * 

a(i  +1) 

if  »n  <  * 

0"(‘n  )  “  (‘n  -»  ) 

if  i  <  in 

<r(»x  )  +  (i  —ix  ) 

if  i  >  ix 

(4) 


where  ix  =max{t :  i  e  i_dom  ( S  )}  and  =min{i ;  i  €  i_dom  (S  )|. 

Similarly,  the  column  complement  of  S  may  be  defined  by  adding  to  S  one  position 
for  each  column  j  4  j_dom  (S  ).  More  specifically, 

S'  *  ):  j  =  l.-ji } 


where 


f*(.j  )  =  Mj  )  if  j  €  j_dom  (S ) 


(5.a) 


and 


m(/’)  ^  1) 


(5.b) 


and  a  unique  definition  of  n  may  be  given  in  a  way  analogous  to  (4). 

Given  a  stripe  structure  ZA  =  the  row  and  column  complements  of  ZA  are 

defined,  respectively  by  ZA  =  {Sj . S* }  and  =  |Sj . Sc„).  In  Figure  2  we  show 

the  complements  of  the  structure  of  Figure  1.  Note  that  any  position  (i  .j )  in  and  ZA 
with  i  or  /  not  in  { 1 .  •  •  •  n  }  is  assumed  to  contain  a  zero. 


£ 


4 


Fig  2  -  The  complements  of  the  structure  of  Fig  1 
3.  SYSTOLIC  MULTIPLICATION  OF  A  VECTOR  BY  A  STRIPED  MATRIX 


3.1.  Using  column  complemented  stripe  structures. 

In  this  section,  we  assume  that  a  stripe  structure  ZA  =  {Si.  •  •  Sw)  is  given  for  a 
matrix  A  .  and  we  present  a  systolic  algorithm  for  the  multiplication  of  any  vector  x  by  A 
using  it  linearly  connected  systolic  cells  (see  Figure  3).  Each  systolic  cell  k  has  four  input 
ports  (denoted  /  j  -  I4)  and  two  output  ports  (denoted  O  j  and  02)-  It  also  contains  storage 
for  an  n-dimensional  array  SXk  which  is  initialized  to  zero.  In  brief,  the  elements  of  the 
stripe  Sk  are  multiplied,  in  cell  k .  by  the  corresponding  elements  of  x  and  the  results  are 
stored  in  SXk  .  The  partial  sums  are  then  accumulated  across  the  cells  1.  •  •  •  ,7r. 

In  order  to  describe  the  algorithm,  we  let  7  be  the  time  defined  by  the  global  clock 
(number  of  cycles  since  the  start  of  operation),  and  we  assume  that  the  elements  of  the  vec¬ 
tor  x  are  applied  to  port  /  j  of  cell  it  starting  at  time  7  =  1.  and  that  each  cell  transmits  the 
content  of  Ix  to  0\  in  one  cycle.  We  also  assume  that  each  cell  k  executes  ir—k  trivial 
cycles  before  it  starts  receiving  actual  data  and  performing  useful  operations,  for  simpli¬ 
city,  we  let  tk  =7  — (rr— *  )  be  a  local  time  defined  at  cell  k  . 

At  any  given  local  cycle  tk  of  cell  k  ,  the  element  o  ^  (,  of  the  complemented  stripe 
Sk  is  supplied  to  1 3.  and  the  row  index  fik  (tk  )  is  supplied  to  / 4.  During  the  same  cycle,  the 


-6- 


(a)  Unidirectional  data  flow 


(b)  Bidirectional  data  flow 
Fig  3  -  Systolic  striped  matrix/vector  multiplication 

corresponding  element  xtk  sbould  arrive  at  I  x.  thus  allowing  for  the  formation  of  the  pro¬ 
duct  term  <*jikuk)tk*  x'» ■  ThL5  term  is  then  stored  in  SXk  [jttt  (tk  )].  Here,  we  note  that  if 
f*  e  j_dom  (St ).  then  fxk  (tk  )  =  nk  (tk  ).  and  hence,  <*jikukuk  *  xtk  “  ai.<rkU)  *  x<rk(>)-  where 
* 

The  elements  y  i.  ■  •  •  ,yn  of  the  y  data  stream  .initialized  to  zero,  are  propagated 
through  the  network  either  in  the  same  direction  as  the  x  data  stream  (see  Fig  3a),  or  in  the 
opposite  direction  (see  Fig  3b).  In  either  case,  when  an  element  y*  arrives  at  cell  k  ,  it  picks 
up  the  corresponding  term  stored  in  SXk  [i  ].  But  SXk  [i  ]=  at  tTk  (i  )*  xGk  (i  >  if  Sk  contains  a 
position  at  row  i .  and  SXk  [t  ]=0  otherwise.  Hence,  for  any  specific  i .  y,  visiuall  of  the  n 
cells  and  accumulates  the  sum  of  the  terms  a*  j  Xj .  for  which  (t  ,j  )tSk  for  some  k  .  Given 

n 

that  ctj  j  »0  if  there  is  no  k  such  that  (i  .j  )eSk  .  we  conclude  that  y*  =  £  a*  ;  xt .  the  i,h 

j=  i 

element  of  the  product  A  x  . 

Both  the  x  and  y  data  streams  may  flow  in  the  network  simultaneously.  However,  it 
is  important  to  ensure  that  an  element  y,  will  not  arrive  at  a  cell  k  before  the  term 


ai.<rkU)  *  x<rk(i)  is  computed  and  stored  in  SXk  [i  ].  With  this  in  mind,  the  operation  of  cell 
k  in  the  network  may  be  described  by  the  algorithm  of  Fig  4.  in  which  the  y  stream  is 
assumed  to  arrive  at  cell  k  lagging  behind  the  x  stream  by  8k  cycles.  For  simplicity,  we 
assume  that  storage  is  allocated  (and  initialized  to  zero)  in  each  cell  for  SXk  [i  ], 
i  =—8k  .  •  •  •  .0.  and  that  x*  =0  for  i  =rt  +1.  •  •  •  ji  +8k  . 

Starting  at  time  T  =  v—k  +1  DO  J 

FOR  cycles  tk  =  1.  ••  •  ji  +8*  DO 

1)  Read  x,k  from  1 1  into  £  and  y,k  ~&k  from  1 2  into  T). 

2)  Read  a  -k  ^  •jJk  from  I3  into  a  and  nk  (r*  )  from  1 4  into  X. 

3)  SXk  [A]  ^  a*  £ 

4)  i)  t)  +  SXk  U*  — 8k  ] 

_ 5)  Write  £  and  t)  on  O  k  and  P2.  respectively. _ 

Fig  4  -  The  operation  of  cell  k 

The  next  goal  is  the  determination  of  the  value  of  8t  which  ensures  that  SXk  [rt  — 8t  ] 
is  computed  in  some  cycle  tk‘  <  tk  .  In  order  to  be  able  to  apply  our  results  on  two  leveled 
pipelined  systolic  machines  [l],  we  assume  that  the  multiplication  and  the  addition  in  each 
cell  are  performed  on  pipelined  units  with  p'  and  stages,  respectively.  In  this  case,  the 
results  of  step  3  in  Fig  4  is  stored  in  SXk  at  cycle  tk+p* .  This  means  that  at  cycle  tk .  the 
values  of  SXk  [l],  •  •  •  £Xk  [  fik  (rt  )]—/>’  are  already  computed  and  thus  we  must  have 

Which  is  satisfied  if 

8*  ^  j  —  fik  (y  )  +  p’  for  any  j  e  j_dom  ( Sk  )  (6.a) 

Equation  (6.a)  specifies  the  minimum  value  of  8k  required  in  cell  k  to  ensure  correct 
operation.  However,  the  elements  of  the  y  data  stream  flow  in  the  cells  of  the  network  in 
an  orderly  manner,  which  implies  some  relation  between  the  values  8k  .  k  =1 . n. 


In  order  to  be  more  specific,  we  consider  the  case  of  bidirectional  data  flow  (Fig  3b).  If 
yt  and  ym  are  at  cells  k  and  k  +1.  respectively,  during  the  same  global  cycle  T ,  then 
l  —  tk  —8k  =  T  —TT+k  —&k  and  m  =  tk+l— 8t+1  =  T— ir+k  +1— 8t+i.  But  the  y  stream  has 
to  travel  through  the  addition  pipeline  units,  and  hence  l  =  m+p*.  which  gives 

8t+1  *  8*  +P*  +  1  (6  b) 

Hence.  6*  depends  on  8j,  which,  in  turns,  depends  on  the  time.  Ts[arl  ^  .  at  which  the 
input  y  \.y2-  •  ■  •  is  initiated  at  port  1 2  of  cell  1.  More  specifically.  Tuart  ^  is  the  global  time 
T  corresponding  to  the  local  time  t  x  =  8j+l  at  cell  1.  This  gives 

Si  =  *  start  j  -  *  (6  c) 

Clearly,  it  is  possible  to  satisfy  (6.a)  by  taking  Taart  ^  =  n  +ir+p‘ ,  which  means  that  the  y 
stream  is  applied  into  the  network  after  the  x  stream  exits  it.  thus  ensuring  that  SXt  [i  ] 
will  contain  the  correct  values  upon  the  arrival  of  y, .  This  two-phase  approach,  however, 
has  two  disadvantages,  namely.  1)  if  each  cell  is  capable  of  performing  both  a  multiplica¬ 
tion  and  an  addition  in  each  cycle  (as  is  the  case  with  the  Warp  [l]).  then  the  cells  are 
under-utilized  because  only  a  multiplication  is  performed  in  the  first  phase,  and  only  an 
addition  is  performed  in  the  second  phase,  and  2)  the  two  phase  approach  is  not  applicable 
to  the  solution  of  triangular  linear  systems,  in  which  the  values  in  the  x  stream  depends  on 
tv  results  obtained  in  the  y  stream. 

Fortunately,  we  do  not  have  to  wait  until  the  x  stream  exits  the  network  before  we 
input  the  y  stream.  More  specifically,  it  is  straight  forward  to  check  that  the  condition 
(6.a)  is  satisfied  if 

^ start  jf  =  B2  +  P‘  +1T  (7) 

where  ^  the  upper  band-width  of  the  matrix  A  ,  which  is  necessarily  larger  than  j—i 
for  any  }  ^ 0.  and  thus,  larger  than  7— Mi  (/)  for  *ny  7=1 n  and  ^=1 ir-  More¬ 
over,  it  may  be  shown  that  TItartJ  given  by  (7)  is  the  minimum  starting  time  for  the  y 
stream  which  will  always  guarantee  the  correct  operation  of  the  network.  With  this 


starting  time,  any  specific  y*  appears  on  port  0 2  of  cell  ir  at  time  i +B 2+(p++l)Tr+p‘ . 
Hence,  the  matrix/vector  multiplication  may  be  computed  in  n  +2? 2+(p++l )ir+p*  systolic 
cycles. 

The  same  type  of  analysis  may  be  applied  to  the  case  where  the  y  data  stream  flows  in 
the  same  direction  as  the  x  stream.  In  this  case,  however,  equation  (6.b)  is  replaced  by 
8*  =*  8t+1+/>++l.  and  equation  (6.c)  is  replaced  by  8„  =  Tstarl  y—  1.  The  minimum  starting 
time  for  the  y  stream  and  the  execution  time  of  the  network  may  then  be  found  to  be 
B 2+p’  +1  and  n  +B2+p+ir+p’  +1  respectively. 

Finally,  it  should  bt  noted  that  at  any  particular  cycle  tk  of  cell  k  .  only  the  locations 
SXk  [//*  [r*  ] . SXk  [r*  —8*  ]  of  SXk  may  be  occupied.  Hence,  the  n  dimensional  array  SXt 

may  be  replaced  by  a  circular  buffer  of  length  ma x[(ik  (/)— j  +St  }  -  B  l+B2+2p‘ ,  where 

*  .j 

B 1  is  the  lower  band-width  of  A  .  It  satisfies  Bk  ^  maxl/r*  ( j  )—j }. 

*  j 

3.2.  The  multiplication  of  x  by  A  T 

The  exact  same  networks  described  in  the  previous  section  may  be  used  for  the  multi¬ 
plication  of  x  by  the  transpose  of  the  matrix  A  .  Namely,  if  the  row  complement  of  Sk 
rather  than  its  column  complement  is  supplied  to  cell  k  in  the  networks  of  Fig  3,  then  the 
y  streams  will  accumulate  the  elements  of  the  product  vector  A  T  x  . 

In  order  to  be  more  specific,  we  assume  that  each  cell  k  in  the  networks  of  Fig  3  exe¬ 
cutes  the  algorithm  of  Fig  4  after  the  replacement  of  step  2  by 

2)  Read  from  /3  into  a  and  cr4  ( tk  )  from  /4  into  X. 

With  this  replacement,  the  term  atk.Vi(tk)*  xtk  >s  computed  at  cycle  tk  and  stored  in 
SX*  [<ft  (r*  )].  But.  if  tk  ei_dom(Sk).  then  cr*  (rt. )  =  cr*  (r*  )  and  a(*  .vk(tk )  *  x,t  - 
ai--kU)j  *  ***(/)•  where  j  =  <r*  (rt ).  Hence,  for  any  j=\.--,n.  SX*!/]  contains 
U  *  x n„ O' )  if  (f*(  j  )■/  )€5*  .  and  contains  zero  otherwise. 


Now.  each  element  y}  in  the  y  data  stream  visits  every  cell  k  and  picks  up  from  it 

the  content  of  SXk  [j  ].  Upon  exiting  the  network.  y} ,  thus,  contains  the  sum  of  the  terms 

\ 

«i  j  Xj  for  which  (i  ,j  )eSt  for  some  k  .  But.  a{  j  = 0  if  (t  ,j  )  does  not  belong  to  any  stripe. 

n 

Hence,  at.j  *«» the  )th  element  of  Ar  x . 

i  *1 

The  value  of  the  delay  factor  8*  which  ensures  the  correct  operation  may  be  obtained 
by  an  analysis  identical  to  the  one  described  in  Section  3.1.  This  analysis  gives  a  condition 
on  8*  similar  to  (6.a).  Namely 

Sk  ^  i  -  <r(i )  +  p’  for  any  i  €  i_dom  (St  )  (8) 

Assuming  that  the  y  stream  flows  in  the  direction  opposite  to  that  of  the  x  stream, 
then  both  (6.b)  and  (6.c)  hold.  The  minimum  starting  time  for  the  y  data  stream  is.  then, 
found  to  be 

^  start  j  ~  B{+  p’  +  7T  (9) 

where  B  x  is  the  lower  band-width  of  the  matrix  A  .  In  this  case,  the  execution  of  the  net¬ 
work  requires  n  +B  t+y  cycles,  where  y=(/>++l)7r+/>’ . 

33.  The  case  of  symmetric  matrices 

Let  H  be  a  symmetric  matrix  which  may  be  decomposed  into  A  +  D  +  A  r .  where  A 
is  a  strictly  lower  triangular  matrix  and  D  is  a  diagonal  matrix.  Let  also 
E*  *  {Sj.  •  •  •  be  a  stripe  structure  for  A  and  ZD  =  fSw}  be  the  stripe  structure  of 

D  .  where  S„  contains  all  the  diagonal  positions  (t  ,i  ),  i  .  Clearly,  the  matrix  H  has 

2tr— 1  stripes,  and  hence,  the  multiplication  of  any  vector  x  by  H  may  be  performed  on  a 
systolic  network  with  2ir— 1  cells  in  approximately  n  +B2  cycles,  as  described  in  the  previ¬ 
ous  sections. 

However,  the  goal  of  this  paper  is  to  use  systolic  networks  in  the  iterative  solution  of 
linear  systems.  This  involves,  besides  matrix/vector  multiplication,  the  solution  of  triangu¬ 
lar  linear  systems.  In  Section  3.  it  is  shown  that  the  solution  of  triangular  systems  may  be 


performed  on  a  systolic  network  composed  of  i r  cells.  Hence,  if  the  multiplication  y  =Hx 
may  also  be  performed  using  only  tr  cells,  then  the  same  systolic  network  may  be  used  for 
both  operations,  thus  increasing  the  utilization  of  the  network,  and  the  efficiency  of  the 
entire  solution  process. 

The  reduction  of  the  number  of  cells  used  in  the  multiplication  y  =  H  x  may  be 
accomplished  by  first  computing  the  partial  result  vector  z  »  (A  +D  )  x  .  and  then  comput¬ 
ing  y  *z  +  A  T  x .  Given  that  the  stripe  count  of  A  is  it—  1.  and  that  of  D  is  one.  then, 
the  two  operations  may  be  performed,  sequentially,  on  a  network  of  it  cells.  This,  how¬ 
ever.  requires  that  the  partial  result  vector  z  be  stored  outside  the  network. 

It  is  possible  to  avoid  the  external  storage  of  2  if  the  two  operations  (A  +  D  )x  and 
AT x  are  interleaved.  In  this  case,  the  internal  storage  SXk  in  cell  k .  1  <ir.  may  be 
used  to  hold  temporarily  the  product  of  stripe  Sk  by  x  (generated  during  the  multiplication 
(A  +Z)  )x  ).  and  then  to  accumulate  the  product  of  the  transpose  of  Sk  by  x  (generated  dur¬ 
ing  the  multiplication  Arx).  Clearly,  cell  ir  is  only  responsible  for  the  multiplication  of 
Sw  by  x  and  hence  no  interleaved  operation  is  needed  for  that  cell. 

A  more  precise  description  of  the  operation  of  a  cell  k  .  <ir,  is  given  in  Fig  5. 
Note  that  each  execution  of  the  body  of  the  FOR  loop  corresponds  to  two  systolic  cycles, 
where  in  any  specific  cycle,  one  addition,  one  multiplication,  and  at  most  one  read/write 
from/to  each  port,  may  be  performed. 

It  is  important  to  ensure  that  the  content  of  a  location  SXk  [X],  which  is  computed  at 
step  2.2  in  any  given  cycle,  is  not  overwritten  by  the  execution  of  step  1.3  in  any  future 
cycle.  In  other  words,  for  any  given  tk  .  X  =  &k  (t*  )  should  be  smaller  than  X  *  fik  (rt  ). 
This  condition  is  always  satisfied  for  any  lower  triangular  matrix  A  . 

As  clear  from  the  algorithm  of  Fig  3,  the  x  and  y  streams  travel  in  the  network  at  a 
speed  of  one  cell  every  two  cycles.  The  value  of  the  delay  8*  should  satisfy  both  condi¬ 
tions  (6.a)  and  (7).  Assuming  that  the  two  data  streams  flow  in  opposite  directions,  then 
the  minimum  starting  time  for  the  y  data  stream  is  the  largest  of  rnar,  J  and  faarl  j  given 
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Starting  at  time  T  -  ir—k  +1  DO 
FOR  tk  «  1.  •  -  a  +«*  DO 

Cycle  1:  1.1)  Read  x,k  from  / i  into  i  and  y ,k-t>k  from  1 2  into  7). 

1.2)  Read  atk  Vk(,k )  from  I3  into  a  and  ff*  C tk  )  from  / 4  into  X. 

1.3)  SXk  [X]  «-  <x*£ 

1.4)  7)  t)  +  SXk  [f*  Sk  ] 

Cycle  2  :  2.1)  Read  o^kuk)/k  from  h  int0  a  and  )  from  /4  into  X. 

2.2)  SXk  [X]  -  SXk  [X]  +  a  *  £ 

2.3)  Write  (  and  7)  on  O  t  and  02.  respectively. _ 

Fig  3  -  the  multiplication  of  a  vector  by  a  symmetric  matrix 
by  (7)  and  (9).  respectively.  However.  A  is  a  lower  triangular  matrix  for  which  Z?2=0. 

Hence,  the  minimum  starting  time  for  the  network  is  given  by  (9).  and  the  computation  of 
Hx  terminates  in  2(n  +5j+y)  cycles,  where  ■y=(/>++l)7r+/>‘  is  a  constant  which  depends 
on  the  number  of  cells  and  the  architecture  of  each  cell. 

4.  THE  SOLUTION  OF  TRIANGULAR  LINEAR  SYSTEMS 

In  this  section,  we  consider  lower  triangular  linear  systems  of  the  form 

i 

(  A  +  D  )  x  *6  i 

where,  as  in  Section  3.3.  A  is  strictly  lower  triangular  with  stripe  structure  ZA  = 

fl 

{S j,....5ir_]}.  and  D  is  diagonal  with  stripe  structure  1D  =  {S*}.  The  solution  of  such  sys¬ 
tems  may  be  computed  on  any  of  the  two  systolic  networks  shown  in  Figure  3.  For  simpli¬ 
city.  we  start  by  assuming  that  the  network  has  bidirectional  data  flow  (Fig  3b). 

The  operation  of  the  network  is  similar  to  that  of  the  original  forward  substitution 
network  of  Kung  and  Leiserson  [5],  However,  the  same  techniques  used  in  Section  3.1  are 
applied  such  that  each  cell  deals  with  a  stripe  rather  than  a  diagonal.  More  specifically,  the 

first  tt— 1  cells  of  the  network,  namely  cells  X  =1 .  •  ■  •  ,n—  1,  perform  the  matrix/vector  j 
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multiplication  y  ai4x  starting  at  time  rx .  That  is  the  elements  of  x  are  supplied  to  port  I  \ 
of  cell  ir— 1  starting  at  time  rx .  Subsequent  analysis  will  show  that  rx  -  p*+p‘  +1. 


Hence,  the  operation  of  each  cell  k  .  1  ^ir— 1,  is  described  by  the  same  algorithm  of 
Fig  4.  except  that  execution  starts  at  time  T  -  ir—k  — 1+r, .  This  change  in  the  starting 
time  does  not  affect  the  conditions  (6.a)  and  (6.b)  that  should  be  satisfied  in  order  to  ensure 


correct  operation  and  data  flow. 


The  last  cell  in  the  network,  namely  cell  ir.  is  responsible  for  recursively  computing 
the  i,h  element  of  x  from  the  i,h  element  of  y  -Ax .  Assuming  that  the  elements  of  the 


right  hand  side  vector  b  are  applied  to  port  I\  of  cell  ir  starting  at  time  7=1.  and  noting 
that  fJLj.i  )  *  i .  we  may  describe  the  operation  of  cell  ir  by  the  following  Algorithm: 


FOR  cycles  tw  ■  1.  •  •  •  ji  DO 


/*  Here  T  -t*/ 


1)  Read  (  1  )  from  J3  into  a 

2)  Read  b,w  from  7j  into  £.  and  y,w  from  I2  into  i). 


3U  -  U-f))*  a 


4)  Write  (  on  O  3  and  02.  /*  02  is  considered  the  output  of  the  network  */ 


First,  we  note  that  we  avoided  the  division  operation  in  cell  ir  by  supplying  to  the  cell 


the  reciprocals  of  the  diagonal  elements,  which,  in  this  case,  should  be  computed  outside  the 
network  (by  the  host  for  example).  Relieving  cell  ir  from  performing  the  division  opera¬ 


tion  allows  the  execution  rate  of  all  the  cells  in  the  network  to  be  equal,  but  increases  the 


load  on  the  host.  However,  during  the  iterative  solution  of  linear  systems  the  same  tri¬ 
angular  system  is  solved  in  each  iteration  for  a  different  vector  b .  Given  that  we  usually 
iterate  hundreds  of  times,  the  one  time  increase  in  the  load  of  the  host  is  justifyable. 


Assuming  that  multiplication  and  subtraction  (equivalent  to  addition)  in  cell  ir  are 
pipelined,  it  is  clear  that  the  value  of  x<  appears  on  port  O  2  of  cell  ir  (and  thus  on  /  j  of  cell 
ir— 1)  at  global  cycle  T  sp*+p‘  +i .  During  that  same  global  cycle.  yfi*+2/>»+i  should  be  at 
port  I2  of  cell  ir— 1.  Hence,  from  the  algorithm  of  Fig  4.  we  find  that  the  value  of  8„_i  at 
cell  ir— 1  is  equal  to  —ip’  +  2p+).  and  thus,  from  (6.b) 
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8*  -  -  C  P*  -2  +  (.ir—k  +1)  (/>++l)  )  *  =1.  •  •  ,ir— 1  (10) 

In  other  words.  8t ,  k  =1 . tt— 1.  are  completely  determined  by  the  data  flow.  How¬ 

ever.  condition  (6  .a)  should  still  be  satisfied  in  order  to  ensure  the  correct  operation  of  cells 
1.  •  •  ■  ,ir— 1.  With  (10)  this  condition  becomes: 

M*(/)~y  ^  2  (p*-l)  +  (ir-t +1)  (p++l)  *=l.  -,ir- 1  (11) 

which  basically  specify  that  stripe  Sk  should  be  separated  from  the  diagonal  by  at 
least  the  right  side  of  the  inequality.  Hence,  the  network  operates  correctly  only  if  the 
stripes  of  the  input  matrix  satisfy  condition  (11).  In  this  case,  th-  last  element  of  the  solu¬ 
tion  vector  x  is  computed  at  global  time  T  ~rx  +n  ,  that  is  the  network  terminates  its  exe¬ 
cution  after  p++p’  +l+n  cycles. 

For  example,  assume  that  both  the  multiplication  and  the  addition  are  performed  on 
5-stage  pipelined  units,  that  is  p+  =  p*  =  5.  Hence,  equation  (11)  indicates  that  the  first 
lower  stripe,  namely  S„_i  should  be  away  from  the  diagonal  by  at  least  20  positions,  the 
second  lower  stripe.  S,_2  by  at  least  26  positions,  and  so  on.  Multicolor  numbering  tech¬ 
niques  are  available  [8.  9].  which  rearrange  the  rows  and  columns  of  the  matrix  such  that  to 
satisfy  this  type  of  stripe  separation. 

The  stripe  separation  condition  (11)  may  be  relaxed  if  the  data  in  the  input  streams  to 
the  network  are  spread  appropriately.  More  specifically,  if  the  data  at  any  input  port  are 
applied  at  the  rate  of  one  data  item  every  9  cycles,  rather  than  every  cycle,  then  it  may  be 
shown  that  9  8*  will  appear  in  the  left  side  of  equation  (10),  and  thus  the  condition  (11) 
becomes 

0  0**0)-/)  >  2(/>*-l)  +  (ir-*+l)(p++l)  *=1.  ••  ,ir-l  (12) 

which  may  always  be  satisfied  by  the  proper  choice  of  9.  Clearly,  the  execution  time  in  this 
case  increases  to  n 9+p‘  +p++ 1  cycles. 

Although  we  considered,  so  far.  only  networks  with  bidirectional  data  flow,  it  is 
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equally  possible  to  solve  triangular  systems  on  networks  with  unidirectional  data  flow. 
For  this,  the  y  data  stream  is  initiated  at  port  12  of  cell  v—1.  and  the  result  of  Ax  is  fed 
back  from  port  O  2  of  cell  1  to  port  1 2  of  cell  ir  (see  Fig  6).  The  operation  of  each  cell  in  the 
network  remains  the  same,  but  the  value  of  8*  may  be  found  to  be  8*  =—(/>’  —  (*  +l)p+). 
Correct  operation  of  the  network,  in  this  case,  requires  that  the  stripes  of  the  input  matrix 
satisfy  fik  ( j  )—j  >  ( k  +l)p+  rather  that  (11). 


Fig  6  -  Unidirectional  data  flow  with  feed  back 

Finally,  it  should  be  noted  that  the  solution  of  upper  triangular  systems  of  the  form 
(A T  +D  )x  =  b  are  very  similar  to  the  solution  of  lower  triangular  systems.  Hence,  the 
same  networks  may  be  used  and  the  same  analysis  techniques  may  be  applied. 

5.  APPLICATION  TO  THE  PRECONDITIONED  CONJUGATE  GRADIENT  METHOD 

The  preconditioned  conjugate  gradient  method  (PCCG)  [2]  is  one  of  the  best  known 
iterative  method  for  the  solution  of  large  sparse  linear  systems  of  the  form  Hx  -f  .  where 
H  is  an  n  Xn  symmetric,  positive  definite  matrix.  Each  iteration  of  the  PCCG  involves  the 
multiplication  of  a  vector  by  H  and  the  solution  of  two  triangular  systems  of  the  forms 
(A  +D  )x  =8  and  (A  T  +D  )h  =x  .  where  A  is  a  strictly  lower  triangular  matrix  derived 
from  H .  Denoting  by  <x  ,y  >  the  inner  product  of  two  vectors  x  and  y  .  the  PCCG  may 
be  described  as  follows: 

Choose  an  initial  guess  x0. 

r0'*f—Hx0  :  h*M~lr0  ;  y0sl<r0Ji>  :  0=0 

Repeat  for  i  =0.  •  •  •  until  y/  ^  €.  where  €  is  an  acceptable  error 

1)  1.1)  ^  =  h  +  0  p,  _! 

1.2)  y  Pi 
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1.3)  a  «  <y . Pi  >. 

2) 

2.1)  a  *  —  . 
a 

3) 

31)*i+i»*i  +api 

3*2) ri+1  *  rt  -  ay 

3.3)  Solve  (A  +D  )b  *  r4  +1 

4) 

4.1) Solve (AT  +D)h  -6. 

4*2)  i  ■  > 

5) 

5. 1)0«^-. 

Yi 

If  H  is  irregularly  sparse,  then  A  is  usually  chosen  such  that  (A  +D  +Ar)  has  the 
same  sparsity  structure  as  H .  and  thus  the  same  stripe  structure.  Let 
t,H  *  {Sf,  •  •  •  Sw.  •  •  •  S2w- i)  be  a  stripe  structure  for  H  such  that  Sw  is  the  diagonal 
{(*  i  )  :  i  *l.....n  }.  Hence.  ZA  *  {Sj.  •  • 

In  Fig.  7.  we  show  a  complete  systolic  system  for  the  execution  of  the  PCCG.  The  rec¬ 
tangular  block  labeled  ARRAY  in  the  figure  is  a  systolic  array  consisting  of  ir  cells.  This 
array  is  used  for  the  execution  of  steps  1.2,  3.3  and  4.1  in  each  iteration  of  the  PCCG.  Each 
cell  in  ARRAY,  however,  is  slightly  difiFerent  from  the  cell  shown  in  Fig  3.  More 
specifically,  the  input  ports  /3  and  I4  in  each  cell  in  Fig  3  are  omitted  in  ARRAY,  and  the 
data  which  is  supplied  on  these  ports,  namely  the  elements  of  the  stripes  of  H  and  A  .  is 
stored  in  local  memories.  In  other  words,  each  cell  k  in  ARRAY  has  enough  memory  to 
store  both  the  row  and  column  complements  of  stripe  Sk  for  both  H  .  and  A  .  This  requires 
six  n  -dimensional  arrays.  The  in-cell  storage  of  the  stripe  relieves  the  host  from  supplying 
the  same  information  to  the  cells  in  each  iteration. 

The  other  operations  in  the  PCCG  are  essentially  vector  and  scaler  operations.  A  sys¬ 
tolic  cell  similar  to  the  ones  used  in  ARRAY  is  added  to  the  system  to  perform  most  of 
these  operations.  This  cell  is  denoted  by  ir+ 1  in  Fig  7.  In  addition  to  multiplication  and 
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v+l 


ARRAY 


Host 


Fig  7  -  A  systolic  system  for  PCCG 


addition,  cell  ir+1  should  be  able  to  perform  division  in  either  software  or  hardware.  The 


time  for  division  is  not  crucial  because  only  two  division  operations  are  performed  in  each 


PCCG  iteration,  namely  in  steps  2.1  and  2.5.  Also,  cell  ir+ 1  is  assumed  to  have  local 


storage  for  a  temporary  array  SXw+t.  and  for  the  scalers  a.  0.  y,  and  y,  +1. 


The  host  in  Fig  7  is  a  general  purpose  computer  which  initializes  the  systolic  cells  and 


initiates  and  controls  each  iteration  step.  The  vectors  ft-i.  rt  and  x,  reside  in  the  host.  At 


each  iteration,  the  host  supplies  cell  ir+ 1  with  the  vectors  p,-\  and  r, .  and  receives  from  it 


the  vectors  Pi  and  rj+  x  and  the  scalers  y1+1  and  a.  It  is  also  the  responsibility  of  the  host  to 


compute  x(  +i  in  step  3.1.  and  to  check  the  value  of  yi+j  for  convergence. 


Initial  content  of 
Host  -*  cell  ir+1 
ceil  ir+ 1  -  ARRAY 
ARRAY  -*  cell  ir+1 
cell  ir+1  -*  Host 
Final  content  of  SXV+ 1 


Execution  in  ARRAY 
Execution  in  cell  ir+1 
Execution  in  Host 


Execution  time 


Table  1  -  Summary  of  communication  and  computation  steps 


The  various  computations  assigned  to  each  unit  in  the  system  and  the  data  movement 


between  units  are  summarized  in  Table  1.  Each  column  in  the  table  corresponds  to  one  step 


in  a  PCCG  iteration.  Data  movement  is  specified  in  the  first  six  rows  of  the  table,  and  com¬ 


putational  activities  are  specified  in  the  following  three  rows. 


Given  Table  1.  it  is  possible  to  trace  the  execution  of  the  system  for  each  step  in  a 
FOOG  iteration.  For  example,  during  the  first  step,  the  host  sends  the  elements  of  p,  _i  to 
cell  ir+1  at  a  rate  of  one  element  every  two  cycles.  When  cell  ir+1  receives  the  Ith  element 
of  Pi-\.  namely  Pi-J.1 ].  it  computes  pt  [Z ]*h  [/ ]  (note  that  h[l  J  is  initially  stored 
in  SXT+1[Z  ]),  and  sends  the  result  to  both  ARRAY  and  the  host.  ARRAY  receives  the  ele¬ 
ments  of  Pi  and  returns  to  cell  ir+1  the  elements  of  the  product  vector  y  —Hp[ .  at  the  same 
rate.  Given  that  cell  ir+1  is  busy  executing  step  1.1  only  every  second  cycle,  the  idle  cycles 
may  be  used  for  the  computation  of  step  1.3.  That  is.  whenever  celi  ir+l  receives  an  ele¬ 
ment  y  [Z  ]  from  ARRAY,  it  accumulates  its  contribution  to  a.  that  is.  it  computes 
ar*a+y  [/  ]  *  p,  [l  ].  before  storing  y  [Z  ]  in  SX,+j[f  ]. 

The  last  row  in  Table  1  gives,  for  each  step,  the  number  of  systolic  cycles  required  for 
the  completion  of  the  step.  In  order  to  simplify  the  table,  it  is  assumed  that  each  multiply, 
add.  or  divide  operation  terminates  in  one  cycle.  That  is  p*  -  p*  -  p*  -  1.  By  adding  the 
execution  times  of  the  five  steps,  we  conclude  that  PCCG  iterations  may  be  executed  at  a 
rate  of  one  iteration  every  4(n  +rr)+251+10  systolic  cycles.  Note  that,  every  systolic  cell 
in  the  system  is  doing  useful  work  in  almost  every  cycle.  That  is.  for  large  n  ,  the  utiliza¬ 
tion  of  the  system  is  almost  100  percent. 

Using  the  WARP  for  the  PCCG  system 

The  Warp  is  a  10  cell  systolic  machine  which  have  been  developed  at  Carnegie  Mellon 
University  [l].  Each  cell  has  a  pipelined  multiplier  and  a  pipelined  adder,  with  5  stages 
each.  In  its  current  form,  the  Warp  may  not  be  used  to  implement  the  PCCG  system 
described  above,  namely  because  it  does  only  allow  for  a  homogeneous  mode  of  operation. 
That  is  all  the  cells  should  execute  the  same  program.  This  requirement  will  be  eliminated 
in  the  next  version  of  the  machine,  namely  the  PC  Warp. 

The  matrix/vector  multiplication  algorithm  of  Section  3  was  implemented  on  the 
current  Warp  in  order  to  test  the  suitability  of  Warp-like  machines  to  sparse  matrix  mani¬ 
pulation.  The  W2  programming  language  [3]  was  employed  and  the  W2  compiler  was  used 
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to  generate  the  micro-code  for  the  individual  cells.  Although,  the  computational  results 
were  correct,  the  timing  results  reported  by  the  compiler  indicates  that  the  number  of 
cycles  required  for  execution  is  almost  twelve  times  the  theoretical  number  of  cycles 
obtained  in  Section  3.  This  inefficiency  is  attributed  to  the  following  reasons: 

1)  In  the  current  Warp,  only  one  read  and  one  write  operation  to/from  local  memory 
may  be  performed  in  any  given  cycle.  In  our  algorithm,  one  of  the  multiplication 
operands  and  one  of  the  addition  operands  have  to  be  read  from  the  local  memory. 
Hence  addition  and  multiplication  may  not  be  initiated  in  the  same  cycle. 

2)  The  current  Warp  does  not  allow  in-cell  integer  arithmetic.  Hence  any  address  genera¬ 
tion  required  for  array  access  (even  tho'»e  of  the  type  base  +offset )  has  to  be  done 
using  the  floating  point  addition  unit.  This  source  of  inefficiency  should  disappear  in 
the  new  PC  Warp  which  includes  an  in-cell  unit  for  integer  arithmetic. 

3)  Inefficiency  of  indirect  addressing:  Although  indirect  addressing  is  supported  by  the 
Warp  hardware,  the  task  of  managing  this  indirect  mode  by  the  compiler  is  difficult, 
especially  when  index  arithmetic  is  performed  in  floating  point  with  the  result  avail¬ 
able  after  5  cycles. 

6.  CONCLUSION 

A  complete  systolic  algorithm  for  the  the  solution  of  large  sparse  linear  systems  of 
equations  is  described.  The  algorithm  may  be  implemented  on  a  linear  array  of  systolic 
cells  attached  to  a  host  computer.  Data  flows  regularly  in  the  array  and  computation  is  dis¬ 
tributed  uniformly  among  its  cells,  thus  leading  to  an  almost  perfect  utilization  of  the 
resources.  This  perfect  utilization,  however,  could  not  be  achieved  when  the  CMU  Warp 
systolic  machine  [l]  was  used  to  implement  parts  of  the  system,  namely  because  the  cells  in 
the  Warp  lack  some  of  the  basic  capabilities  that  we  assumed  in  our  systolic  array.  Most  of 
the  lacking  capabilities  will  be  incorporated  in  future  versions  of  the  Warp,  thus  allowing 
for  the  implementation  of  an  efficient  solver  for  large  sparse  linear  systems. 
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